Measuring psychological distress using the 12‐item general health questionnaire and the six‐item Kessler psychological distress scale. Psychometric comparison and equipercentile equating of the two scales

Abstract Objectives This study aimed to examine if the General Health Questionnaire (GHQ)‐12 and Kessler 6 (K6) assess the same underlying construct and to develop a score conversion table for the two scales. Methods A random sample of 4303 people who completed both the GHQ‐12 and K6 in 2021 were analyzed. Exploratory bifactor analysis evaluated if both scales measured the same construct, and Rasch analysis assessed item severities. The scales were transformed using Equipercentile equivalence for comparability and score conversion. Agreement was estimated with Cohen's Kappa coefficient, along with raw positive and negative agreement. Results We found that the two scales measure the same phenomenon to the extent that they can be made equivalent. Conversion tables between GHQ‐12 and K6 are presented. Applying the commonly used cut‐off of ≥3 on the GHQ‐12 bi‐modal scoring, we found that the best corresponding cut‐off on the K6 would be ≥8. The prevalence of psychological distress was then 22% with GHQ‐12% and 21% with K6. Conclusions The GHQ‐12 and K6 measure the same construct and corresponding cut‐off scores on one scale were found for the other scale. This is valuable for longitudinal studies or time series where one scale has replaced the other scale.

score on one scale corresponds to a specific score on the other, and to examine the level of agreement between the scales.

| Study sample
Using stratified random sampling, a total sample of 50,270 individuals aged 16 and older was drawn from 38 municipalities and city districts in Stockholm County in 2021.After excluding persons who were dead or had moved from Stockholm County the eligible sample was 47,855 individuals.The invitation to participate was sent together with a link to a web survey and was also followed by a postal survey.
The questionnaires consisted of questions about health, lifestyle, risk factors, background information, and psychological distress.The main instrument used to measure psychological distress was the K6 and was part of the main questionnaire.A random sample of those invited to participate (20%) received an extended questionnaire which included the 12 items of the GHQ-12 (n = 9558 in total).A total of 23,072 individuals participated in the survey, 48.2% of the total invited.The sample that answered the questionnaire comprising the GHQ-12 was 4531 individuals (47.4% response rate).In this study, we excluded individuals with missing items on the GHQ-12 or K6 instruments, resulting in an analytic sample of 4303 individuals.

| GHQ-12
The GHQ-12 was constructed based on questions about depression, anxiety, and social impairment, by selecting those that differed most in occurrence between a clinical population (primarily with neurosis) and a control without reported psychiatric problems (Goldberg, 1972).The instrument comprises of questions on how frequently six symptoms about feelings/behaviors and six symptoms related to functioning have occurred in the past few weeks.It has both positively and negatively worded items to correct for the tendency to generally answer positively or negatively.The negatively worded questions all have the same response options, namely, (1) Not at all, (2) no more than usual, (3) more than usual, and (4) much more than usual.The positively worded questions have the following response options, from "best health" to "worst health", namely: (1) More than usual/better than usual, (2) as usual, (3) worse/less than usual, and (4) much worse/less than usual.The Swedish version of the GHQ-12 was originally translated by Diderichsen and Janlert (1992) and the criterion validity has been examined elsewhere (Lundin et al., 2016(Lundin et al., , 2017)).
Two types of summary scores are used on the GHQ-12.Either the answers are scaled 0-3 which are summed (also called the Likert scoring, 0 1 2 3), or the individual questions are dichotomized (0-1) and the sum scores are added together (bi-modal scoring, 0 0 1 1).
The former can be said to be an intensity and the latter a symptom count.We used both scoring methods in this study.

| Kessler 6
The six-item Kessler scale was developed and published in 2002 (Kessler et al., 2003) for the screening of symptoms alike the Diagnostic and Statistical Manual of Mental Disorders (DSM) symptoms for depression and generalized anxiety in the general population.The K6 has been used in several population surveys including the WHO's World Mental Health Survey (Kessler et al., 2010).The Swedish version used in both the Stockholm public health surveys, "Hälsa Stockholm", and National public health survey, "Health on equal terms" is a translation by the Center for Epidemiology and Community Medicine (CES).Both K10 and K6 are free to use without permission (available here: https://www.hcp.med.harvard.edu/ncs/k6_scales.php).
The K6 comprises of questions that asks how much of the time in the past 30 days, a person has felt nervous, hopeless, restless, or fidgety, so depressed that nothing could cheer them up, that everything was an effort, and worthless.It has five responses: (1) None of the time, (2) a little of the time, (3) some of the time, (4) most of the time, and (5) all of the time with scoring of 0 1 2 3 4.The sum of scores is between 0 and 24.

| Procedure and statistical analysis
As a first step of examining the possibility of linking the two scales, we examined Pearson and Spearman correlation between the scale scores.As a rule of thumb, correlations above 0.70 are considered a lower bound for linking (Fayers & Hays, 2014).We then proceeded to exploratory bifactor analysis to investigate to what extent the K6 and GHQ-12 measure the same phenomenon/construct.Scree test from Principal Component Analysis (PCA), Horn's Parallel Test (Horn, 1965), Velicier's Minimum Average Partial Factor Retention Method, MAP (Velicer, 1976) and Revelle's Very Simple Structure, VSS (Revelle & Rocklin, 1979)  Unidimensionality is not a requirement for linking, but in order to claim that the result can be used interchangeably -said to be equated-the two scales must reflect the same phenomenon.
Next, we investigated differences in severity by examining the distribution of responses in the raw total scores, and the severity of each question's response options and differences in measurement goals.The distribution of raw scores reflects partially the distribution of phenomena studied and the design of the scale.Since the population remained constant, differences in distribution are likely to be due to the severity of the items.The Partial Credit Model (PCM), a Rasch model for polytomous questions, was used to investigate the relative severity of each question.This was done by examining the correlation between the probability of choosing a specific response option and a latent (modeled) measure of mental illness.Response option severity was used to construct information curves for the two scales to examine the scales target with regard to measurement precision.
Because the distributions of the two scales were different, we used equipercentile linking to equate the two tests (Kolen & Brennan, 2014), that is, transform values of one scale into another.This method determines the relative position of scores through percentiles-the transformed values are then adjusted to be the same as the raw scores in the same percentile group.The conversion tables between the GHQ-12 and K6 thus aim to find which score on one instrument corresponds to a certain score on the other instrument, while correlation coefficients provide answers to the degree to which the two scales generally co-vary.Equipercentile linking belongs to the traditional linking methods and was chosen over "modern" IRT linking methods because (1) it does not require the strong assumptions of IRT (e.g.choosing the correct statistical model and assumption concerning the distribution of true scores) and ( 2) with a single group design there was not any doubt of non-equivalent groups, which is the principal requirement in traditional linking.With the scales equated, that is, put on the same metric, Bland-Altman figures were used to examine the overall level of agreement.Lastly, since both GHQ-12 and K6 are often used by dichotomization, we compared the agreement at different cut-offs values that are commonly used.The measure of agreement used was Cohen's kappa coefficient, a common test that adjusts for chance (McHugh, 2012).
We also calculated the unadjusted agreement (proportion of total agreement) and the percentages of positive and negative agreement, measures that have their counterpart in sensitivity and specificity.

| Do GHQ-12 and K6 measure the same phenomenon?
The Pearson and Spearman correlation coefficients were 0.77, meaning that 59% of the variance on one scale is explained/predicted by the other scale.The distribution of responses in the population (Figure 1) shows that both GHQ-12 (left graph) and K6 (right graph) have asymmetric distributions, with a positive skew.The skew is more obvious in K6.
To further investigate possible covariation structures, several tests were performed on the GHQ-12 and K6 jointly.The first three eigenvalues from a PCA of the items from both scales were: 8.75, Based on these test results, we proceeded on the basis that the GHQ-12 and K6 contained a strong dominant component, but that there is possibly another covariation pattern in the data.Bifactor analysis was therefore performed to examine the unidimensionality and size of the general and two additional specific factors.The results are presented in Table 1.
The bifactor analysis clearly differentiated between the two scales, the specific factors consisted of questions from the respective scales.However, none of the questions had a strong loading on the specific factors.Omega Hierarchical was 0.77, which is interpreted as 77% of the variance in raw scores is due to the general factor.EVC, a measure of the common variance of the general factor was of similar strength (0.70).Most of the variance was thus explained by the general factor.The questions that contributed the most to the general factor (factor loadings) were as follows: for K6, questions 2 (hopelessness) and 4 (depressed); and for GHQ-12, questions 9 (depressed), 10 (hopelessness), and 11 (uselessness).Notably, all these questions were about symptoms associated with depression.Given that the two scales were more similar than different (i.e., unidimensional), the next step was to investigate design effects from response options through fitting the PCM. Figure 2 shows a Wright map plotting the Thurstone thresholds of the K6 and GHQ-12 item responses against the latent trait computed for the items.Each point represents where the transition (a 50/50 odds) to higher response alternatives occurs.
The first circle (on the left of the figure) shows where the second answer option in each question transitions to being more likely than the first, the second point where the third option is more common than others, and so on.It was evident that six of the GHQ-12 questions had responses that were used by those with relatively low scores on the latent scale, which was a difference between the scales.These six questions were the GHQ-12 questions that are positively worded.

F I G U R E 1
Proportional distribution of scores on the GHQ-12 (left) and K6 (right).GHQ, general health questionnaire; K6, Kessler 6.

T A B L E 1
Factor loadings from bifactor analysis.2 and 3) show which specific scores of the K6 equate to scores of the GHQ-12.These values can be used to monitor prevalence and trends over time, also in studies where one scale has replaced the other.However, it should be noted that those who were identified as cases on one scale were not exactly the same individuals that were identified as cases on the other scale.

Specific
GHQ-12 and K6 differ in number of questions, response options and to some extent content, which means that the scales may function differently.The correlation between the two scales, using the Pearson and Spearman methods, could be classified as just below strong (r = 0.77), but above what is often considered to be acceptable for linking (r = 0.70) (Fayers & Hays, 2014).The factor analysis indicated that the scales had more in common than they differed, but the GHQ-12 tended to show a more complex structure.This is in agreement with previous studies on the K6 (Kessler et al., 2010) and GHQ-12 (Hankins, 2008;Werneke et al., 2000).The differences are likely because the K6 was developed primarily with factor analysis in order to achieve unidimensionality, while the GHQ-12 was developed with the primary aim to distinguish patients from non-patients.
The dimensionality of the GHQ-12 has been the focus of several studies.Exploratory factor analyses have suggested one dimension as T A B L E 2 Conversion table between GHQ-12 (Likert scoring, 0-36) and K6.

GHQ-12 (Likert scoring) K6
36 well as 2-3 dimensions, but a meta-analytic study (Gnambs & Staufenbiel, 2018) concluded that GHQ-12 was essentially unidimensional, with an ECV from Bi-factor analysis of 79% leaving little variance to subdomains, which is on concert with our ECV of 70% for GHQ-12 and K6 combined.A Rasch analysis on a longer version of the scale, the GHQ-30, found that the dimensionality does not necessarily come from the content of the questions, but is a result of the questions having different answer options (Andrich & Van Schoubroeck, 1989).Like our study, that study found that for positively worded items, more response options were used among those F I G U R E 3 GHQ-12 total scores (Likert, xaxis) and corresponding K6 total score (y-axis).GHQ, general health questionnaire; K6, Kessler 6.

T A B L E 4
Prevalence and agreement between cut-off values between GHQ-12 (Likert scoring, 0-36) or GHQ-12 (bi-modal scoring, 0-12) and K6 (0-24).this depends on the purpose of the study and the base rate.

GHQ-12 (likert) cut-off values
The agreement between GHQ-12 and Kessler when cut offs are calibrated are in line with those found for example, self-assessed depression scales when compared with diagnostic interviews (Eaton et al., 2007).

| Strengths and limitations
A limitation of the study is that the participants were not assessed with a structured psychiatric interview.Psychiatric interviews are generally considered gold standard when determining whether a person has a diagnosis or not and are therefore used to test questionnaires' cut-offs for diagnoses.Thus, this study cannot answer which specific cut-off values on the GHQ-12 or the K6 best corresponds to such case finding.It should be noted that distress, the phenomena targeted, is measured to screen for disorders but may also be used as a continuous measure in its own right, without cut offs.The single group design, specifically the volume of questionnaires administered, may have contributed to rater fatigue.Moreover, the order of the scales was not randomized and the GHQ-12 was presented last.
The potential effect of order could not be evaluated.Another potential limitation is that individuals with severe mental illness (e.g., in inpatient care) may be less likely to respond to surveys and could therefore be less represented in the sample than in the community.

| CONCLUSIONS
The GHQ-12 and K6 are measures of the same phenomena, distress, but with different score distributions.Scale scores of the two can be linked, showing which score best agrees with scores on the other, which is useful for comparisons.Linking the scale scores does not imply total agreement, but the moderate agreement is at par with diagnostic self-assessment scales.
were used to estimate the number of factors to extract.The bifactor analysis was used to investigate how much of the explained variance is common and how much of the variance measured is due to something unique.Further, we tested to what degree each question contributes to the common factor using bifactor analysis.Factor loadings and variance based on Omega Hierarchical (Omega H) and Explained Common Variance are presented to indicate the unidimensionality.While unidimensionality is relative, high values on Omega H and ECV indicate that from a measurement perspective there is a dominating general dimension in the data.
and 0.59, indicating a dominant factor and possibly, an additional one (based on eigenvalue >1).Horn's Parallel test, which adjusts for measurement error through simulations, also suggested two factors.The Revelle's Very Simple Structure test suggests a minimum of one factor and a maximum of two (complexity = 0.93 and 0.95, respectively).The Revelle's Very Simple Structure suggests a minimum of one factor and a maximum of two (complexity = 0.93 and 0.95, respectively).
Strengths of the study include the large sample size of 4303 individuals being assessed with parallel tests and the random sampling of individuals in the Stockholm region.The study had a comparably high response rate (47%) which is also a strength.The use of equipercentile equating is yet another strength which enabled comparison of scores even though the relation between the scales was not linear.

GHQ-12 (bi-modal scoring) cut-off values GHQ12 (bi-modal scoring) prevalence K6 cut-off values K6 prevalence Kappa coefficient Positive agreement Negative agreement
Thus, the more complex structure of the GHQ-12 may be due to design (different type of response options), rather than content of the items.Cut-off recommendations are commonly based on sensitivity and specificity, but while studies often make claims of certain sensitivity and specificity being optimal (based on e.g., Youden's index or highest sum of sensitivity and specificity)